When Type 2 Processing Misfires: The Indiscriminate Use of Statistical Thinking about Reasoning Problems

Research on dual-process theories of judgment makes abundant use of reasoning problems that present a conflict between Type 1 intuitive responses and Type 2 rule-based responses. However, in many of these reasoning tasks, there is no way to discriminate between the adequate and inadequate use of rules based on logical or probabilistic principles. To experimentally discriminate between the two, we developed a new set of problems: rule-inadequate versions of standard base-rate problems (where base rates are made irrelevant). Across four experiments, we observed conflict sensitivity (measured in terms of response latencies and response confidence) in responses to standard base-rate problems but also in responses to rule-inadequate versions of these problems. This failure to discriminate between real and merely apparent (or spurious) conflict suggests that participants often misuse statistical information and draw conclusions based on irrelevant base rates. We conclude that inferring the sound use of statistical rules from normatively correct responses to standard conflict problems may be unwarranted when this kind of reasoning bias is not controlled for.


Introduction
One of the pioneering contributions of the initial work of the research program on heuristics and biases Kahneman 1974, 1983;Kahneman et al. 1982) was the careful development of several reasoning tasks that were used to demonstrate that, more often than not, people's judgments seem to violate basic principles of probability and logic. One of these tasks was the so-called "lawyer-engineer problem" (Kahneman and Tversky 1972): In a study, several psychologists interviewed a group of people. The group included 5 engineers and 995 1 lawyers. The psychologists prepared a brief summary of their impression of each interviewee. The following description was drawn randomly from the set of descriptions: Dan is 45. He is conservative, careful, and ambitious. He shows no interest in political issues and spends most of his free time on his many hobbies, which include carpentry, sailing, and mathematical puzzles. Which of the following is more likely?

(a) Dan is an engineer (b) Dan is a lawyer
When facing the problem above, many people might rely on the similarity between Dan's description and the stereotype of engineer to infer that Dan is an engineer (without taking into account the prior odds of being an engineer or a lawyer; 5/995). In other words, people's judgments often rely on intuitive processes (e.g., judgment by representativeness; Kahneman and Tversky 1972) rather than analytic ones (reasoning considering the initial base rates).
to stereotype-based information. In this case, false-alarm answers involved (inadequate) conclusions drawn from large but biased samples. This was crucial to show that even brief sessions of statistical training increased the use of the law of large numbers in standard problems without increasing its use in situations where it would be inappropriate to do so (i.e., rule-inadequate problems).
In a similar vein, Klauer and Singmann (2013) developed "pseudo-syllogisms", which may be described as rule-inadequate versions of standard syllogism problems. Indeed, pseudo-syllogisms modify the logical status of the original syllogisms while keeping the superficial structure of the problems (formal features and contents) unchanged. The use of "pseudo-syllogisms" allowed Klauer and Singmann to reinterpret previous findings that people can intuitively detect the logicality even of difficult syllogisms (Morsanyi and Handley 2012). We will return to this point in the General Discussion.
How to Discriminate between the Suitable and Unsuitable Use of Normative Rules in the Base-Rate Neglect Task To experimentally distinguish between reasoning that makes an appropriate use of rules and reasoning that uses rules in an indiscriminate fashion, we used the conflictdetection paradigm and invited participants to respond not only to standard (conflict and no-conflict) versions of base-rate problems but also to rule-inadequate versions of these problems. The latter present a conflict between stereotypical information and probabilistic information in the form of invalid or irrelevant base rates that do not provide a valid basis for decisions. That is to say, the conflict in rule-inadequate problems is only apparent or spurious.
To illustrate, take the well-known lawyer-engineer problem (Kahneman and Tversky 1972) presented in the beginning. A rule-inadequate version of the same problem would be as follows: In a study, several psychologists interviewed a group of people. The group included 5 engineers and 995 lawyers. In the first day of the study, an equal number of engineers and lawyers were interviewed, and the psychologists prepared a brief summary of their impression of each interviewee. The following description was drawn randomly from this set of descriptions: Dan is 45. He is conservative, careful, and ambitious. He shows no interest in political issues and spends most of his free time on his many hobbies, which include carpentry, sailing, and mathematical puzzles. Which of the following is more likely?

(a) Dan is an engineer (b) Dan is a lawyer
In this version, there is additional information indicating that the base rates (stemming from the group composition: 5 engineers and 995 lawyers) may not be reliable in the current context. For those such as Dan, who were tested on the first day, the prior probability of being an engineer or a lawyer is the same (i.e., 50/50). If participants go ahead and still use the initial base rates to respond, this would be an inappropriate use of the baserate information.
Sound statistical thinking should discriminate between standard and rule-inadequate versions of base-rate problems, taking into consideration the 5/995 group composition in the former but discarding them in the latter (leaving Dan's description as the only source of diagnostic information).
In contrast, an indiscriminate reliance on the base rates of the group composition (5/995) is expected to lead to responding according to them both in the standard and rule-inadequate versions of the problem.
Importantly, the conflict-detection paradigm further allowed us to look into the participants' metacognition. More specifically, whether participants show some sensitivity to the opposing information in standard conflict problems (e.g., De Neys and Glumicic 2008).
If participants do not take logical principles into account (e.g., neglecting base-rate information), then conflict should be irrelevant and have no impact on reasoning. However, previous findings indicate that biased reasoners often do show conflict sensitivity by displaying increased response doubt. This is reflected in longer response latencies and lower confidence in their incorrect answers to conflict problems compared to correct answers to no-conflict problems (De Neys 2012;De Neys and Glumicic 2008;Pennycook et al. 2015).
We expect to observe the same kind of conflict effects with rule-inadequate versions of base-rate problems as have been observed with standard base-rate problems. That is, we predict that the irrelevant base rates in rule-inadequate problems are going to trigger (inaccurate) logical intuitions that oppose other T1 intuitions (based on the similarity between the target's description and a professional/group stereotype). This may lead to the detection of a spurious conflict between invalid base rates and stereotype-based information that will be reflected by longer responses latencies (Experiments 1 to 4) and lower response confidence (Experiment 4).
Finally, individual differences in reflective thinking ability have been shown to be associated with judgment biases and errors (e.g., Toplak et al. 2011). We thus used the cognitive reflection test (CRT; Frederick 2005) to assess participants' susceptibility to the overgeneralization bias. We expect higher reflective ability to be associated with less overgeneralization bias.

Experiment 1
Experiment 1 is a first attempt at examining the extent to which base-rate responses to standard problems stem from adequate statistical thinking or from an indiscriminate use of base rates (i.e., using the base rates both when they are valid and invalid). In order to obtain measures of processing time, participants responded to a "moving window" paradigm (De Neys and Glumicic 2008;Just et al. 1982), in which there were three kinds of problems: standard base-rate problems-conflict and no-conflict versions-and conflict versions of rule-inadequate problems, which present a spurious conflict between base rates and stereotype-based information. In this paradigm, the base-rate information and the stereotypical description are presented separately. Participants were first presented the baserate information on a computer screen, which was then substituted by the stereotypical description and the question. Before giving their answer, participants had the option to visualize the base-rate information again by pressing and holding down a specific computer key.
De Neys and Glumicic (2008) showed that not only normatively correct responses but also incorrect responses to problems that presented a conflict between the description and the previously presented base rates took longer (and were associated with more revisits of the initial base-rate information) than responses to no-conflict problems (where base rates were congruent with the stereotype description), thus providing evidence consistent with conflict detection even for participants who gave the heuristic response.
Our main predictions are as follows. An indication of the indiscriminate use of baserate information might be found in the form of a positive correlation between base-rate responses for standard and rule-inadequate problem versions. This would show that the more participants give base-rate responses to standard problems, the more they overgeneralize the use of base rates to cases where the base rates are invalid (i.e., rule-inadequate problems) 2 . Furthermore, detection of (a spurious) conflict between the stereotype-based information and irrelevant base rates (followed by unwarranted use of these base rates) should be apparent if base-rate responses to rule-inadequate problems are also coupled with longer response times and a higher frequency of base rate reviews (when compared to no-conflict problems). In addition, participants' responses to standard versions of base-rate problems are expected to replicate De Neys and Glumicic's (2008) results.

Participants
Eighty-six psychology undergraduate students (70 female; M age = 21.2, SD = 6.02) participated in the study in exchange for course credits 3 . Sensitivity analysis with this sample size (at α = .05, and power = .80) showed that the experimental design could reliably detect medium or larger effect sizes (f ≥ .34).

Material
The material included 18 problems (translated and adapted to Portuguese from De Neys and Glumicic 2008). Three were presented during practice trials, while the remaining fifteen were used in the experimental task. The problems followed the common design of base-rate problems, with a description of the base rates for two groups of people and a description of characteristics of a subject selected randomly from those groups. The description was either stereotypical of the larger group (no-conflict problems) or of the smaller group (conflict problems). The original problems were changed to include information that rendered the described base rates irrelevant (rule-inadequate problems) or kept these base rates relevant (standard problems). The following is an example of one of the problems, with the text that renders the base rates irrelevant in italics: In a study 1000 people were tested. Among the participants there were 996 women and 4 men. In the first week of the study, as many men as women were interviewed. Jo was one of the people interviewed in the first week.
Jo is 23 years old and is finishing a degree in engineering. On Friday nights, Jo likes to go out cruising with friends while listening to loud music and drinking beer.
What is most likely? a. Jo is a man b. Jo is a woman In this case, the tested subsample comprises the same number of men and women, so the relevant base rates indicate a 50/50 chance (rather than the initial 996/4) for the described person to be a woman or a man.
For standard problems, the phrase in italics reads, "In the first week of the study, all men and women were interviewed." As such, the relevant base rate is still 996/4, making "b" ("Jo is a woman") the most likely correct response (examples for all variations of problems in Experiment 1 are displayed in Appendix B, Table A3).
The fifteen problems were separated into three sets of five problems each. In each set, the problems were all of one type: standard conflict problems, standard no-conflict problems, or rule-inadequate problems.
Problem content was counterbalanced across problem types and not repeated for an individual participant. Conflict and no-conflict versions of standard problems were exactly the same except for the base rates, which were inverted. Rule-inadequate versions were the same as conflict versions, except that the original base rates were rendered irrelevant with the addition of the subsample information (each participant responded to only one of the three versions of each specific problem). All problems presented extreme base rates with three variations between them (997/3, 996/4, or 995/5), which were assigned evenly between the fifteen problems.

Procedure
Participants performed the experiment on a computer, where they received instructions to read the texts and to solve the problems at their own pace. Participants were presented 3 practice trials.
Participants were informed that, in each trial, they would see the first part of the problem (containing the base-rate information). After reading the text and whenever they felt ready to move on, they should press the space key on the keyboard to replace the base-rate information with the second part of the problem (containing the stereotypical description of the selected individual and the response options). During the display of this second part, participants could again have access to and review the first part with the crucial base-rate information by pressing the space key. As long as they held down the space key, the first part remained visible. Once the space key was released, the information disappeared again. The second part of the text with the description always remained visible after the initial presentation.
The main dependent variables included the proportion of base-rate responses for each type of problem: the base-rate reading time (when this information was first presented without the stereotype information) and the decision-making time (i.e., the time elapsed between the display of stereotype information and response). The mean number of problems where base-rate information was reviewed was also analyzed for each type of problem.

Results
In this and the remaining experiments, we begin by describing and analyzing the proportion of base-rate responses across the experimental conditions (see Appendix B, Table A1), as well as the correlation of base-rate responses between standard and ruleinadequate versions of the conflict problems (see Table 1). Evidence for the overgeneralization bias is provided by positive and significant correlations between the two. Furthermore, to test for conflict detection, we then analyze response times (Experiment 1 to 4; see Appendix B, Table A2), mean base-rate reviewing (only Experiment 1), and response confidence (only Experiment 4) separately for standard and rule-inadequate problems. If increased response doubt occurs, not only for standard, but also for ruleinadequate problems, this will be an indication of sensitivity to a spurious conflict between response outputs based on stereotypical descriptions and opposing (but irrelevant) base rates (see Table 2 for a summary of the conflict detection results).    Figure 1 presents the mean proportion of responses according to the base rates in noconflict problems, and standard and rule-inadequate versions of conflict problems.
Differences in responses across problems were significant (F(2, 170) = 169.36, p < .001, ηp 2 = .67). Almost all responses were according to the base rates (and stereotype-based information) in no-conflict problems (M = .97; SE = .03). About half of the responses to conflict problems (M = .46; SE = .03) and one-fourth of the responses to rule-inadequate problems (M = .26; SE = .03) were according to base rates. This greater reliance on valid (compared to invalid) base rates may indicate some sensitivity to the quality of the baserate information at the aggregate level of analysis. Differences in responses across problems were significant (F(2, 170) = 169.36, p < .001, η p 2 = .67). Almost all responses were according to the base rates (and stereotype-based information) in no-conflict problems (M = .97; SE = .03). About half of the responses to conflict problems (M = .46; SE = .03) and one-fourth of the responses to rule-inadequate problems (M = .26; SE = .03) were according to base rates. This greater reliance on valid (compared to invalid) base rates may indicate some sensitivity to the quality of the base-rate information at the aggregate level of analysis.
However, and as predicted, the correlation between standard and rule-inadequate problems was positive (r = .46, p < .001), such that the more participants decided according to base rates in standard problems, the more they did so in rule-inadequate problems. This suggests an overgeneralization bias in the use of base-rate information.

Base-Rate Reading Time
The initial base-rate reading time (i.e., the time people initially spent reading the first part of the problem before the description was presented) did not vary for the three types of decisions (F < 1), which indicates that the initial presentation of the base-rate information was not processed differently across problem versions.

Base-Rate Reviewing for Standard Problems
A repeated-measures ANOVA, with an average number of problems reviewed for correct responses to no-conflict problems and base-rate and stereotype-based responses to standard conflict problems, did not reach significance (F(2, 90) = 2.59, p = .080, η p 2 = .05). The reviewing tendency of base-rate information was higher for base-rate responses to conflict problems (M = .43, SE = .06) than for correct responses to no-conflict problems (M = .33, SE = .49) but was also non-significant (F(1, 45) = 3.35, p = .074, η p 2 = .06). There was no difference in reviewing between stereotype-based responses to conflict problems (M = .31, SE = .05) compared to correct responses to no-conflict problems (F < 1).

Discussion
A larger number of participants used base rates when they were a valid source of information (standard problems) than when they were not (rule-inadequate problems), suggesting some discrimination between relevant and irrelevant base-rate information. Nevertheless, the mean frequency of responses in the latter category (26%) is non-negligible. Importantly, the positive correlation between base-rate responses to standard and ruleinadequate problems is congruent with an overgeneralization bias in the use of base-rate information. Indeed, those participants who relied more on base rates to respond to standard problems also relied more on base rates to respond to rule-inadequate problems.
In addition, response latencies (but not review frequency) provided some indication of conflict detection and possible engagement of T2 reasoning for responses to standard problems, whereas review frequency (but not response latencies) provided some indication of conflict detection and possible engagement of T2 for responses to rule-inadequate problems (see Table 2).
Taken together, these initial findings provide preliminary indications of an overgeneralization bias in the use of base rates and may be seen as questioning the extent to which base-rate responses to the base-rate problems rely on the proper use of the statistical information (as is often assumed).
After having established the basic phenomenon, we sought to increase the number of observations that we could make per person, as well as to potentially reduce the large variability in response time that arose because of the lengthy text passages that we used. Indeed, response times in classic base-rate problems tend to be quite noisy, with the mean RT range varying between 3.4 s and 21.80 s. This variability may have made it difficult to detect the subtle conflict effects.
Furthermore, standard no-conflict problems were used as a baseline for standard conflict and rule-inadequate problems. A better baseline for the latter would be no-conflict versions of rule-inadequate problems, which were included in the following experiments.

Experiment 2
Experiments 2 used a rapid-response base-rate paradigm (Pennycook et al. 2015). In this paradigm, base rates and stereotype-based information are presented in a simplified manner in a fast-paced sequence of computer screens.
On the first screen, the two groups involved in the problem were presented (e.g., "politicians and nannies"); the second screen presented an attribute (e.g., "kind") pertaining to the target subject who was randomly drawn from the total sample of the two groups (e.g., "politicians and nannies"); the third screen presented the groups' composition (e.g., "995 politicians and 5 nannies"); and the fourth screen presented the question and response options (e.g., "is the target person more likely to be a politician or a nanny?").
This paradigm has been shown to be sensitive to conflict detection and subtle increases in T2 thinking (Pennycook et al. 2015) and allows us to substantially increase the number of trials (base-rate problems) per participant.
Furthermore, whereas Experiment 1 used three types of problems (i.e., conflict and noconflict versions of standard problems and conflict versions of rule-inadequate problems), this experiment and the following (i.e., Experiments 3 and 4) also included no-conflict versions of rule-inadequate problems. No-conflict rule-inadequate problems have the same structure as the conflict rule-inadequate problems (i.e., inclusion of additional information that makes the initial base rates irrelevant). However, the initial base rates are aligned with the stereotype-based information. For an outline of these different kinds of problems (standard and rule-inadequate versions of conflict and no-conflict problems), see Table 3. For examples of the problems, see Appendix B, Figure A1. As in Experiment 1, we predict a positive correlation between base-rate responses for standard and rule-inadequate problem versions (i.e., the more participants give base-rate responses to standard problems, the more they overgeneralize the use of base rates to cases where the base rates are invalid). An unwarranted reliance on irrelevant base rates should also be apparent if base-rate responses to rule-inadequate problems are coupled with longer response times (when compared to no-conflict problems), suggesting the detection of a spurious conflict.
Finally, in Experiments 2 to 4, the CRT was included as a measure of participants' analytical skills. Performance in the CRT should predict a more appropriate use of baserate information.

Participants
One hundred-sixteen students from the University of Heidelberg (89 female) performed the experiment for course credits. Sensitivity analysis with this sample size (at α = .05, and power = .80) showed that the experimental design could reliably detect medium or larger effect sizes (f ≥ .31).

Material
Sixty-six pairs of social groups, with opposite stereotype personality traits associated with each group pair, were obtained through pre-testing. In the pre-test, an independent sample of 40 students was presented with a sample of groups (including professions and sociodemographic categories) and a sample of personality traits 5 . Participants were asked to indicate the two traits most stereotypical of each group. The groups were subsequently paired, and two personality traits were selected so that one trait was frequently associated with one group and rarely (if ever) associated with the other, and vice versa for the other selected trait. Twenty-two pairs were presented in the rule-inadequate version and fortyfour pairs were presented in the standard version 6 . In each version, half were conflict trials, and the other half were no-conflict trials. Content was counterbalanced across problem type by creating two lists of trials that varied only in the way the base rates were presented, so that pairs of a conflict type of problem in one list were of a no-conflict type of problem in the other, and vice versa. Participants were randomly presented with one of the two lists of problems.

Procedure
The procedure was based on Pennycook et al.'s (2015) rapid-response base-rate task, with minor adjustments to accommodate the inclusion of rule-inadequate versions of both no-conflict and conflict problems. Specifically, the base rates were qualified by further information referring to how many people from each of the two groups (i.e., initial base rates) were actually tested: the whole sample (in the case of standard trials) or a subsample composed of an equal number of individuals from each of the two groups (in the case of rule-inadequate trials). Participants were also informed that the individual described in each trial was randomly selected from the tested sample.
Participants began by reading the following instructions taken from Pennycook et al. (2015; text in italics corresponds to additional sentences introduced to accommodate the use of rule-inadequate problems): In a big research project, a large number of studies were carried out where short personality descriptions of the participants were made. In every study, there were participants from two population groups (e.g., carpenters and policemen). In each study, one participant was drawn at random from a sample of tested participants. In each trial, you will be presented with a fixation dot where you should look at. After a few seconds, this dot will be replaced by a personality trait for the randomly chosen participant and finally by some information about the composition of the population groups and how many participants from each group were tested in the study in question. After that, you will be asked to indicate to which population group the participant most likely belongs. Please answer the problems as quickly and accurately as possible. Once you have made up your mind, you must enter your answer ('a' or 'b') immediately and then the next problem will be presented.
Following Pennycook et al. (2015), each trial began with the presentation of a fixation point (500 ms), followed by (a) the information concerning the groups involved in the trial (2000 ms), (b) the number of individuals actually tested (2000 ms), (c) the trait describing one of the individuals tested (2000 ms), and (d) the groups' base rates (2000 ms). Participants were then asked to which group the individual most likely belonged (see Figure 2). of problem in the other, and vice versa. Participants were randomly presented with one of the two lists of problems.

Procedure
The procedure was based on Pennycook et al.'s (2015) rapid-response base-rate task, with minor adjustments to accommodate the inclusion of rule-inadequate versions of both no-conflict and conflict problems. Specifically, the base rates were qualified by further information referring to how many people from each of the two groups (i.e., initial base rates) were actually tested: the whole sample (in the case of standard trials) or a subsample composed of an equal number of individuals from each of the two groups (in the case of rule-inadequate trials). Participants were also informed that the individual described in each trial was randomly selected from the tested sample.
Participants began by reading the following instructions taken from Pennycook et al. (2015; text in italics corresponds to additional sentences introduced to accommodate the use of rule-inadequate problems): In a big research project, a large number of studies were carried out where short personality descriptions of the participants were made. In every study, there were participants from two population groups (e.g., carpenters and policemen). In each study, one participant was drawn at random from a sample of tested participants. In each trial, you will be presented with a fixation dot where you should look at. After a few seconds, this dot will be replaced by a personality trait for the randomly chosen participant and finally by some information about the composition of the population groups and how many participants from each group were tested in the study in question. After that, you will be asked to indicate to which population group the participant most likely belongs. Please answer the problems as quickly and accurately as possible. Once you have made up your mind, you must enter your answer ('a' or 'b') immediately and then the next problem will be presented.
Following Pennycook et al. (2015), each trial began with the presentation of a fixation point (500 ms), followed by (a)  Participants were then asked to which group the individual most likely belonged (see Figure 2). Participants began by responding to three practice trials, after which they responded to the block of sixty-six trials (each trial corresponding to a different group pair). After the experimental task, participants responded to the CRT (see Appendix A).
Dependent measures included base-rate responses for each type of problem, as well as response time (RT) of correct responses to no-conflict problems and of stereotype-based and base-rate responses to conflict problems.

Base-Rate Responses
Three participants with accuracies below .80 in the no-conflict standard trials were removed from the analysis (see Pennycook et al. 2015). The mean proportion of base-rate responses for conflict and no-conflict problems for standard and rule-inadequate problems is presented in Figure 3.

Base-Rate Responses
Three participants with accuracies below .80 in the no-conflict standard trials were removed from the analysis (see Pennycook et al. 2015). The mean proportion of base-rate responses for conflict and no-conflict problems for standard and rule-inadequate problems is presented in Figure 3.
A 2X2 ANOVA, with trial version (standard and rule-inadequate) and trial type (conflict and no-conflict) as within-participant factors, and responses based on base rates as the dependent variable, showed a significant main effect of version (F(1, 112) = 85.65, p < .001, ηp 2 = .43), with a higher frequency of base-rate responses for standard problems (M = .86, SE = .02) than for rule-inadequate ones (M = .73, SE = .02). A significant main effect of trial type was also observed (F(1, 112) = 211.78, p < .001, ηp 2 = .65), with a higher proportion of base-rate responses for no-conflict problems (M = .96, SE = .02) than conflict ones (M = .62, SE = .02). Additionally, there was a significant interaction between the factors (F(1, 112) = 50.75, p < .001, ηp 2 = .31), indicating that the difference between standard and rule-inadequate problems was larger for conflict problems (M = .74, SE = .03 for standard problems, and M = .51, SE = .03 for rule-inadequate problems) than for noconflict problems (M = .98, SE < .01 for standard problems, and M = .95, SE = .01 for ruleinadequate problems (see Appendix B, Table A1). Overall, the greater reliance on valid (compared to invalid) base rates indicates some sensitivity to the quality of the base-rate information at the aggregate level of analysis.
However, there was a large positive correlation between standard and ruleinadequate trial versions (r = 50; see Table 1), indicating that the more participants decided A 2X2 ANOVA, with trial version (standard and rule-inadequate) and trial type (conflict and no-conflict) as within-participant factors, and responses based on base rates as the dependent variable, showed a significant main effect of version (F(1, 112) = 85.65, p < .001, η p 2 = .43), with a higher frequency of base-rate responses for standard problems (M = .86, SE = .02) than for rule-inadequate ones (M = .73, SE = .02). A significant main effect of trial type was also observed (F(1, 112) = 211.78, p < .001, η p 2 = .65), with a higher proportion of base-rate responses for no-conflict problems (M = .96, SE = .02) than conflict ones (M = .62, SE = .02). Additionally, there was a significant interaction between the factors (F(1, 112) = 50.75, p < .001, η p 2 = .31), indicating that the difference between standard and rule-inadequate problems was larger for conflict problems (M = .74, SE = .03 for standard problems, and M = .51, SE = .03 for rule-inadequate problems) than for no-conflict problems (M = .98, SE < .01 for standard problems, and M = .95, SE = .01 for rule-inadequate problems (see Appendix B, Table A1).
Overall, the greater reliance on valid (compared to invalid) base rates indicates some sensitivity to the quality of the base-rate information at the aggregate level of analysis.
However, there was a large positive correlation between standard and rule-inadequate trial versions (r = 50; see Table 1), indicating that the more participants decided according to base rates in standard trials, the more they did so in rule-inadequate trials, suggesting an overgeneralization bias in the use of base-rate information.
Furthermore, in order to verify whether more reflective participants were better able to discriminate between trials where base rates were a relevant source of information (standard trials) and trials where they were not (rule-inadequate trials), participants were divided in two subgroups based on their CRT performance.
The low-reflective reasoners included participants who gave one or more incorrect responses to the CRT (N = 70) and the high-reflective reasoners included participants who responded correctly to all three CRT problems (N = 43).
The rationale for this partition was that those participants who respond to all problems correctly were inhibiting and replacing the highly appealing intuitive responses with the analytical correct responses in a consistent way (whereas the remaining participants did it inconsistently or not at all across the three CRT problems). As such, these are the participants with better chances to discriminate between standard and rule-inadequate base-rate problems (i.e., whether or not to inhibit the stereotype-based response).
For low-reflective reasoners, we found a strong positive correlation between baserate responses to conflict standard and rule-inadequate trials (r = .70), whereas for highreflective reasoners there was no correlation (r = .05). High-reflective participants showed a higher frequency of base-rate responses for standard conflict problems compared to rule-inadequate conflict problems (difference of .30) than low-reflective participants (.19). See Table 4 for details.

Response Times
Response times (converted to log 10 ) were analyzed separately for standard and ruleinadequate trials (see Appendix B, Table A2 for means in ms).
Standard trials. A repeated-measures one-way ANOVA, with base-rate responses to no-conflict trials, and base-rate and stereotype-based responses to conflict trials, yielded a significant main effect (F(2, 166) = 41.94, p < .001, η p 2 = .34). Planned comparisons showed that the RT for no-conflict trials (M = 2.96, SE = .02) was faster than base-rate responses to standard conflict trials (M = 3.08, SE = .03; F(1, 83) = 40.18, p < .001, η p 2 = .33) and faster than stereotype-based responses to standard conflict trials (M = 3.17, SE = .03; F(1, 83) = 86.40, p < .001, η p 2 = .51). This pattern of results replicates previous findings by Pennycook et al. (2015), suggesting that responding according to base rates involved the engagement in time-consuming, deliberate reasoning. Furthermore, stereotype-based responses to conflict trials were also slower than to no-conflict trials, providing evidence consistent with conflict detection (between stereotype and base-rate information) for participants who ended up giving the heuristic response.
Rule-inadequate trials. A repeated-measures ANOVA, with base-rate responses for no-conflict trials, and base-rate and stereotype-based responses for conflict trials, yielded a significant main effect (F(2, 186) = 13.54, p < .001, η p 2 = .13). Planned comparisons showed that responses to no-conflict trials (M = 3.02, SE = .03) were faster than base-rate responses to rule-inadequate conflict trials (M = 3.16, SE = .04; F(1, 93) = 24.78, p < .001, η p 2 = .21) and faster than stereotype-based responses to rule-inadequate conflict trials (M = 3.14, SE = .03; F(1, 93) = 20.01, p < .001, η p 2 = .18). In other words, the response-time pattern that was obtained for the standard trials was also obtained for rule-inadequate trials. This suggests that responding according to invalid base rates also involved conflict detection and possible engagement in T2 thinking. Furthermore, there is also indication that participants detected a conflict between traits and opposite (but invalid) base-rate information, even when they opted for the heuristic response.

Discussion
The number of participants who responded according to the base rates when they were a valid source of information (standard problems) was larger than when the base rates were made irrelevant (rule-inadequate problems). This suggests some discrimination between relevant and irrelevant base-rate information. However, the mean frequency of responses in the latter category (51%) is considerable. In addition, the positive correlation between base-rate responses to standard and rule-inadequate problems is congruent with an overgeneralization bias in the use of base-rate information.
The positive correlation in the tendency to respond according to base rates observed between standard and rule-inadequate conflict problems was, however, weaker for highly reflective participants compared to less reflective participants. This could suggest individual differences in the expected direction concerning the relation between rationality (as measured by the CRT) and the discriminate use of statistical information.
Response-time analysis of standard problems provided evidence of conflict detection, both when participants gave the stereotype-based response and when they responded according to the base rates. The same results pattern emerged for rule-inadequate problems, where there is no real conflict between statistical information and stereotype-based information. Taken together, these results suggest that participants failed to discriminate between real and merely apparent (or spurious) conflict and misused statistical information to draw conclusions based on irrelevant base rates.

Experiment 3
Experiment 2 presented the group compositions last, as in the original studies with this rapid-response paradigm (Pennycook et al. 2015, Experiments 1 to 3). However, previous research has shown that presenting a piece of information last, just before judgment, increases the likelihood of its use (Krosnick et al. 1990;Pennycook et al. 2015, Experiment 4). It is thus possible that including the subsample (vs. the group composition) information as the last piece of information prior to judgment would increase the adequate use of base rates. In Experiments 3 and 4, the information about the subsample actually tested and the group compositions was presented in the last two screens (before the screen with the question) and its order of presentation was manipulated.

Participants
Eighty-six students from the University of Heidelberg (71 female) completed the experiment in exchange for course credit. Sensitivity analysis with this sample size (at α = .05, and power = .80) showed that the experimental design could reliably detect medium or larger effect sizes (f ≥ .36).

Material
Forty-four pairs of social groups were selected from the material used in Experiment 2, with opposite stereotypical traits associated with each group pair. Half of the pairs were assigned to standard problems and the other half to rule-inadequate problems. Both kinds of problems had conflict and no-conflict versions (obtained by changing the personality trait associated with the trial).

Procedure
The procedure was the same of Experiment 2. However, the order of presentation of the base rates for each trial and the information referring to how many people from each of the two groups were actually tested-the whole sample (in the case of standard base-rate problems) or a subsample composed of an equal number of individuals from each of the two groups (in the case of rule-inadequate problems)-was manipulated. Participants in the present study either saw the base rates followed by the subsample of individuals who were actually tested or the other way around (subsample information followed by the base rates from where the subsample was taken) (see Figure 4).

Participants
Eighty-six students from the University of Heidelberg (71 female) completed the experiment in exchange for course credit. Sensitivity analysis with this sample size (at α = .05, and power = .80) showed that the experimental design could reliably detect medium or larger effect sizes (f ≥ .36).

Material
Forty-four pairs of social groups were selected from the material used in Experiment 2, with opposite stereotypical traits associated with each group pair. Half of the pairs were assigned to standard problems and the other half to rule-inadequate problems. Both kinds of problems had conflict and no-conflict versions (obtained by changing the personality trait associated with the trial).

Procedure
The procedure was the same of Experiment 2. However, the order of presentation of the base rates for each trial and the information referring to how many people from each of the two groups were actually tested-the whole sample (in the case of standard baserate problems) or a subsample composed of an equal number of individuals from each of the two groups (in the case of rule-inadequate problems)-was manipulated. Participants in the present study either saw the base rates followed by the subsample of individuals who were actually tested or the other way around (subsample information followed by the base rates from where the subsample was taken) (see Figure 4). Participants began by responding to three practice trials, after which they responded to 2 blocks of 44 trials (each trial corresponding to a different group pair). Between blocks, the pairs were the same, but the trait presented for each pair varied so that a conflict problem in one block would be a no-conflict problem in the other block and vice versa. After the experimental task, participants responded to the CRT.

Base-Rate Responses
One participant with an accuracy below .80 in the no-conflict standard problems was removed from the analysis. The mean proportion of base-rate responses for conflict and no-conflict problems for standard and rule-inadequate problems is presented in Figure 5. Participants began by responding to three practice trials, after which they responded to 2 blocks of 44 trials (each trial corresponding to a different group pair). Between blocks, the pairs were the same, but the trait presented for each pair varied so that a conflict problem in one block would be a no-conflict problem in the other block and vice versa. After the experimental task, participants responded to the CRT.

Base-Rate Responses
One participant with an accuracy below .80 in the no-conflict standard problems was removed from the analysis. The mean proportion of base-rate responses for conflict and no-conflict problems for standard and rule-inadequate problems is presented in Figure 5.
In sum, the analysis of base-rate responses replicated that of Experiment 2. Furthermore, when the subsample of participants actually tested was the last information presented, it reduced participants' reliance on irrelevant base rates (in rule-inadequate problems).
In sum, the analysis of base-rate responses replicated that of Experiment 2. Furthermore, when the subsample of participants actually tested was the last information presented, it reduced participants' reliance on irrelevant base rates (in rule-inadequate problems).
Regardless, the proportion of base-rate responses to standard and rule-inadequate conflict problems were positively and strongly correlated (r = .53, p < .001), indicating an overgeneralization bias (see Table 1). The same positive correlation was observed for participants in the subsample-last condition (r = .40, p = .008) and in the base-rate last condition (r = .65, p < .001).
In order to verify whether more reflective participants were better able to discriminate between trials where base rates were a relevant source of information (standard problems) and trials where they were not (rule-inadequate problems), participants were divided in two subgroups based on their CRT performance. The low-reflective thinking group included participants who gave one or more incorrect responses to the CRT (N = 63) and the high-reflective thinking group included participants who responded correctly to all three CRT problems (N = 23).
For low-reflective participants, there was a positive and large correlation (r = .65, p < .001) in the proportion of base-rate responses for standard and rule-inadequate conflict problems, whereas for high-reflective participants, this correlation was smaller and not statistically significant (r = .32, p = .138) (see Table 1). Hence, high-reflective thinking (as measured by the CRT) seems to discriminate slightly better between trials where base rates should (standard problems) and should not be used (rule-inadequate problems). Furthermore, when comparing base-rate responses between standard and rule-inadequate problems, high-reflective participants showed a higher frequency of base-rate responses for standard conflict problems compared to rule-inadequate conflict problems (difference of .51) than low-reflective participants (.26). See Table 4 for details.

Response Times
Response times (converted to log 10 ) were analyzed separately for each kind of problem (see Appendix B, Table A2 for means in ms).

Discussion
In Experiment 3, we found differences in the proportion of base-rate responses as a function of the presentation order of the base rates. Specifically, presenting the crucial information about the number of individuals actually tested after the base-rate information and just before participants were prompted to give their response increased the likelihood of its use and thus helped reduced the reliance on irrelevant base rates. In contrast, there were no changes in the proportion of base-rate responses as a function of order of presentation for standard problems. This is also expected since, in these problems, the last two pieces of information converge in establishing the initial group composition as the problems' relevant base rates (it makes no difference to say "all were tested/from a group of X and Y" vs. "from a group of X and Y/all were tested).
Still, the positive correlation observed in the previous studies between the proportion of base-rate responses to standard and rule-inadequate trials was replicated for participants in the different information orders, which is congruent with an overgeneralization bias. In addition, highly reflective participants showed this tendency to a lesser degree, which suggests individual differences in the expected direction concerning the relation between rationality (as measured by the CRT) and the discriminate use of statistical information.
As in Experiment 2, response-time analysis of standard trials provided evidence of conflict detection, both when participants gave the stereotype-based responses and when they responded according to base rates. Interestingly, the same pattern of results emerged for rule-inadequate problems. Participants detected a conflict between stereotype-based and base-rate information, even when base rates were made irrelevant to respond to the problem. In other words, conflict detection does not seem to discriminate between valid and irrelevant statistical information.

Experiment 4
Experiment 4 was designed to replicate and extend the results of Experiments 2 and 3, with the following modifications. First, to make sure that all participants understood the group compositions of the people actually tested in the rule-inadequate trials, the numbers of Xs and Ys in the subsample were explicitly stated (e.g., "in the first day of the study only 5 politicians and 5 nannies were interviewed"), thus overcoming any potential ambiguity of the formulations used in the previous experiments.
Furthermore, a measure of confidence was added as an additional indicator of conflict detection. After responding to each trial, participants were asked to express the degree to which they were confident in their response. If the opposition between stereotype-based information (traits) and base rates was detected, then the confidence in responses to conflict trials should be lower than the confidence in responses to no-conflict trials (e.g., Bago and De Neys 2017).

Participants
Sixty-four participants from the University of Lisbon (54 females) performed the experiment in exchange for course credit. Sensitivity analysis with this sample size (at α = .05, and power = .80) showed that the experimental design could reliably detect medium or larger effect sizes (f ≥ .42).

Material
The same 44 pairs of social groups from the previous experiment were used (translated to Portuguese).

Procedure
The procedure and design were the same as in Experiment 3, except that (a) the distinction between standard and rule-inadequate problems was created by the inclusion of one screen that established whether the target person in the problem was randomly selected from the total sample described in the group composition information screen ("in the first day of the study, all politicians and nannies were interviewed"), or rather from a subset of the sample where base rates are equal ("in the first day of the study only 5 politicians and 5 nannies were interviewed"); (b) after each trial, participants had to indicate how confident they were in their responses on a 9-point rating scale (from 1-not at all confident; to 9-totally confident); and (c) the final CRT task had one extra problem (see Appendix A).

Results
Six participants who responded to standard no-conflict problems with accuracies below .80 were removed from the analysis. The mean proportion of base-rate responses for conflict and no-conflict problems for standard and rule-inadequate problems is presented in Figure 6. J. Intell. 2022, 10, x FOR PEER REVIEW 20 of 29 for conflict and no-conflict problems for standard and rule-inadequate problems is presented in Figure 6.
Summing up, the analysis of base-rate responses replicated that of the previous experiments. As in Experiment 3, when the subsample of participants actually tested was the last information presented, participants relied less on irrelevant base rates (in ruleinadequate problems).
Summing up, the analysis of base-rate responses replicated that of the previous experiments. As in Experiment 3, when the subsample of participants actually tested was the last information presented, participants relied less on irrelevant base rates (in rule-inadequate problems).
Notably, however, the proportion of base-rate responses to standard and ruleinadequate trials showed a positive and large correlation (r = .71, p < .001), suggesting an overgeneralization bias (see Table 1). The same positive correlation was observed for participants in the subsample-last condition (r = .58, p = .001) and in the base-rates last condition (r = .86, p < .001).
Unfortunately, it was not possible to compare subgroups of high and low cognitive reflection because thirty-three participants (56.90%) erred all CRT problems and only two answered accurately to all problems.

Response Times
RTs (converted to log 10 ) were analyzed separately for standard and rule-inadequate problems (see Appendix B, Table A2 for means in ms).

Confidence
Confidence ratings were analyzed separately for standard and rule-inadequate problems.

Discussion
The results pattern replicated that of Experiment 3. Presenting the number of individuals tested after the group composition information and just before participants were prompted to give their response seems to have made this crucial information more salient to the participants, improving their performance in rule-inadequate trials. Importantly, the correlation between the proportion of base-rate responses to standard and rule-inadequate trials observed in Experiments 1 to 3 was even more pronounced in Experiment 4 (in terms of Cohen's effect sizes, three of the reported correlations were large, and one-Experiment 1-was medium-to-large).
Response-time and confidence-rating analyses of standard trials provided partial evidence of conflict detection both when participants opted for the stereotype-based response and when they responded according to base rates. A similar pattern of results emerged for rule-inadequate trials.
In sum, these results converge with those of Experiments 1 to 3 and show that participants have difficulties in discriminating between relevant and irrelevant base-rate information. This may lead participants to misuse statistical information and draw conclusions based on irrelevant base rates.

General Discussion
In four experiments using two different experimental procedures (moving windows and rapid-response paradigm), we partially replicated the key results of previous research (De Neys and Glumicic 2008;Pennycook et al. 2015). Specifically, we found increased response doubt reflected in terms of longer response latencies, a stronger tendency to revisit the initial base-rate information, and lower confidence ratings (lower feeling of confidence) for responses to conflict versions of standard base-rate problems (when compared to no-conflict problems). This occurred both for base-rate responses and for stereotypebased responses.
These results support the notion that the opposition, in conflict problems, between stereotype-based and statistical information (i.e., base rates) is detected and possibly triggers more deliberate processing.
Interestingly, the results pattern for rule-inadequate problems mirrored the one just described for standard base-rate problems. Stereotype-based and base-rate responses to conflict versions of rule-inadequate problems (where the base rates opposing the stereotypebased information were irrelevant) also took longer and were given with less confidence than responses to no-conflict versions of the same problems. In other words, participants' sensitivity to conflict, as reflected by these measures (response times and confidence ratings), does not seem to discriminate a real conflict between stereotype-based and valid base-rate information from a spurious conflict between stereotype-based information and irrelevant base rates. This is not to say that participants simply did not distinguish between the two versions of standard and rule-inadequate base-rate problems. In fact, participants responded more often according to base rates when they were a valid source of statistical information (standard problems) than when base rates were irrelevant (rule-inadequate problems). In addition, manipulation of the order of presentation of the information (i.e., presenting the subsample actually used after the base rates and before participants were prompted to respond) further reduced the reliance on base rates on rule-inadequate problems.
However, a large positive correlation was consistently observed between base-rate responses to standard and rule-inadequate versions of the base-rate problems. This correlation suggests a tendency to use statistical information (base rates) even when it ceases to provide relevant statistical information. This overgeneralization bias is attenuated among high-reflective participants (as measured by the CRT), which is congruent with the notion that inaccurate rule-based reasoning can be a critical source of response bias.

Inadequate Use of Decision Rules in Reasoning about Everyday Problems
Rule-inadequate versions of standard base-rate problems were quite useful in disentangling adequate from inadequate use of base rates in responding to these problems.
As aforementioned, we are not the first to resort to this type of research strategy. The rationale for our approach was inspired by previous work by Fong et al. (1986). These authors developed so-called false-alarm versions of reasoning problems opposing statistical information (e.g., large but biased samples) to stereotype-based information, to test the effects of brief statistical training on the ability to use statistical principles such as the law of large numbers. By adding these false-alarm problems to their dependent measures, Fong et al. were able to show that statistical training increased the adequate use of the law of large numbers without leading to its over-application to situations where it was not called for. Klauer and Singmann (2013) used rule-inadequate versions of syllogistic reasoning problems, which they referred to as "pseudo-syllogisms" 7 , to further test Morsanyi and Handley's (2012) argument that logicality of syllogisms can be detected in an intuitive manner via changes in affective state (conceptual fluency). They showed that the surface features of the problems were confounded with their logical status and that these features, rather than an implicit intuitive access to the logically correct solution, drove participants' responses.
Likewise, the rule-inadequate versions of the base-rate problems used in the research here reported make the base-rate information irrelevant for the judgment, while leaving the surface features of the problems largely unchanged (e.g., information concerning group composition is still presented). In this sense, rule-inadequate problems may be useful for determining whether conflict-detection effects are driven by surface features that happen to covary with the normative implications of the problems (i.e., the spurious conflict between invalid base rates and stereotype-based information), or whether the normative implications themselves are causally responsible.
Analogous to the work of Klauer and Singmann (2013) in syllogistic reasoning, the use of what we referred to as rule-inadequate problem version (i.e., false-alarm problems) revealed that measures of conflict detection do not necessarily show participants' sensitivity to normative prescriptions. This may contribute to a better characterization of the interaction between different T1 intuitions and of T2 thinking.

Contributions to the Debate among Dual-Process Models of Reasoning
Two prevalent notions in the dual-process literature of reasoning are (a) the view that T2 thinking is responsible for the application of relevant rules of logic and probability that often correct T1 outputs (e.g., Kahneman 2003); and (b) that even when opting for the heuristic-based response in classic reasoning tasks, people have competing logical intuitions (i.e., intuitive access to logical and probabilistic principles) that point to the normatively correct response in classic reasoning tasks, such as base-rate problems (e.g., De Neys 2012). Previous research (Klauer and Singmann 2013) and our results suggest that both of these notions may need to be qualified. Indeed, the present findings suggest that a sizeable fraction of those individuals who appear to respond in an adequate fashion (i.e., according to the normative implications of base rates) in standard base-rate problems may often also rely on statistically irrelevant information rather than making an adequate use of formal rules. In the same vein, processing measures such as (longer) response times and (decreased) confidence may point to the existence of faulty logical intuitions (opposing heuristic ones). This might occur if logical intuitions are at least sometimes driven by salient surface features (e.g., extreme base rates opposing stereotype-based information) that happen to co-vary with sound probabilistic judgment in the case of standard base-rate problems but not in the case of rule-inadequate versions of these problems.
In other words, what have been termed logical intuitions (e.g., De Neys 2012) may not always follow valid probabilistic or logic principles. These intuitions may sometimes merely conform to this principle in standard base-rate problems (though, to be sure, some studies offer compelling evidence for logical intuitions-e.g., Frey et al. 2018;Šrol and De Neys 2021).
Inferring the sound use of statistical rules from normatively correct responses to standard conflict problems might therefore be unwarranted when the tendency to use base rates, even when they are not a relevant source of statistical information (overgeneralization bias), is not considered.
Further research is certainly needed to clarify the conceptual implications of the current findings. The two-response procedure developed and validated by Thompson et al. (2011) could be used to better examine the time course of intuitive and deliberate processing in rule-inadequate problems. In this procedure, participants are asked to give two consecutive responses. At time 1, they are encouraged to respond intuitively and are put under a deadline to minimize the role of analytic thinking. They then provide feeling of rightness (FOR), followed by a second answer, during which reflection is encouraged. Two measures of T2 engagement are obtained: the length of the rethinking period and the probability of changing the initial answer. To what extent faulty logical intuitions would emerge at time 1, and to what extent associated FOR would predict T2 engagement and the confirmation or correction of the initial response to rule-inadequate problems, are interesting research questions waiting to be explored.
In conclusion, the present findings add to the growing body of evidence that challenges the traditional dual-process perspective (e.g., De Neys 2012;De Neys and Pennycook 2019;Newman et al. 2017;Thompson et al. 2018). However, they also beg for a more nuanced interpretation of findings suggesting that people have intuitive access to accurate logic and probability principles (e.g., De Neys 2012;De Neys and Pennycook 2019). It may be the case that sometimes people have intuitive access to faulty logical principles, which may eventually lead to biased rule-based responses.

Institutional Review Board Statement:
The studies were approved by the ethics committee of Faculdade de Psicologia, Universidade de Lisboa.
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The data presented in this study are available on request from the corresponding authors.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
Problems from the cognitive reflection test presented in the experiments In Experiment 2: 1-If 3 workers need 3 min to produce 3 toys, how long would it take 500 workers to produce 500 toys? 2-A TV and a DVD player cost, together, 110€. The TV costs 100€ more than the DVD player. How much costs the DVD player?
3-A virus is spreading in a computer. Every minute, the number of infected files double. If it takes 100 min to infect the whole system, how long does it take to infect half of the system? Added in Experiment 4: 4-John had both the 15 • best note and the 15 • worse note on the class exam. How many students, in total, has the class? Appendix B Table A1. Means and standard errors of base-rate responses for different problems in each experiment.

M (SE) M (SE)
Experiment  Table A3. Examples for all variations of problems used in Experiment 1.

Standard problems
In a study 1000 people were tested. Among the participants there were 4 women and 996 men.
In the first week of the study, all men and women were interviewed. Jo was one of the people interviewed in the first week. Jo is 23 years old and is finishing a degree in engineering. On Friday nights, Jo likes to go out cruising with friends while listening to loud music and drinking beer. What is most likely? a. Jo is a man b. Jo is a woman In a study 1000 people were tested. Among the participants there were 996 women and 4 men. In the first week of the study, all men and women were interviewed. Jo was one of the people interviewed in the first week. Jo is 23 years old and is finishing a degree in engineering. On Friday nights, Jo likes to go out cruising with friends while listening to loud music and drinking beer. What is most likely? a. Jo is a man b. Jo is a woman

Rule inadequate
In a study 1000 people were tested. Among the participants there were 996 women and 4 men. In the first week of the study, as many men as women were interviewed. Jo was one of the people interviewed in the first week. Jo is 23 years old and is finishing a degree in engineering. On Friday nights, Jo likes to go out cruising with friends while listening to loud music and drinking beer. What is most likely? a. Jo is a man b. Jo is a woman Jo is 23 years old and is finishing a degree in engineering. On Friday nights, Jo likes to go out cruising with friends while listening to loud music and drinking beer. What is most likely? a. Jo is a man b. Jo is a woman Figure A1. Examples for all variations of problems used in Experiment 2.

1.
The base rates presented in the original version of the lawyer-engineer problem (70 lawyers and 30 engineers; Kahneman and Tversky 1972) were replaced in this example for more extreme base rates in order to make sure that when the base rates are taken into account, the likelier response option is "Dan is a lawyer").

2.
Base rates and group composition coincide in the case of standard problems but not in the rule-inadequate problems, where base rates are 50/50. However, for ease of presentation, throughout the text we refer to "base-rate responses" for both standard and rule-inadequate problems. These are adequate responses in the case of standard problems and responses based on invalid base rates in the case of rule-inadequate problems. 3.
The number of participants in Experiment 1 and in the remaining experiments here reported were not pre-defined based on power analysis. Experiments were included in different experimental waves, which were ran to completion. The obtained Ns are comparable to previous studies using the same experimental paradigms. Specifically, a total of 86 participants completed De Neys and Glumicic's (2008) Experiment 2 (the same number completed our Experiment 1). The number of participants who completed Pennycook et al.'s (2015) Studies varied between N = 60 and N = 88; whereas in the Experiments 2 to 4 here reported, they varied between N = 64 and N = 116). Tests of sensitivity were ran for all studies. The base rates presented in the original version of the lawyer-engineer problem (70 lawyers and 30 engineers; Kahneman and Tversky 1972) were replaced in this example for more extreme base rates in order to make sure that when the base rates are taken into account, the likelier response option is "Dan is a lawyer"). 2 Base rates and group composition coincide in the case of standard problems but not in the rule-inadequate problems, where base rates are 50/50. However, for ease of presentation, throughout the text we refer to "base-rate responses" for both standard and rule-inadequate problems. These are adequate responses in the case of standard problems and responses based on invalid base rates in the case of rule-inadequate problems. 3 The number of participants in Experiment 1 and in the remaining experiments here reported were not pre-defined based on power analysis. Experiments were included in different experimental waves, which were ran to completion. The obtained Ns are comparable to previous studies using the same experimental paradigms. Specifically, a total of 86 participants completed De Neys and Glumicic's (2008) Experiment 2 (the same number completed our Experiment 1). The number of participants who completed Pennycook et al.'s (2015) Studies varied between N = 60 and N = 88; whereas in the Experiments 2 to 4 here reported, they varied between N = 64 and N = 116). Tests of sensitivity were ran for all studies. 4 By "correct responses to no-conflict problems" we mean responses according to the base rates and stereotype-based information, which both converge in the same response option in these problems. 5 We thank Gordon Pennycook for making available to us the sample of traits used in the studies in Pennycook et al. (2015). 6 The asymmetry in the number of standard problems-twice as many as the rule-inadequate problems-was a mistake in the code of the E-prime program used to run this Experiment. In Experiment 3, this issue was corrected.